Skip to content

Conversation

TinasheMTapera
Copy link

@TinasheMTapera TinasheMTapera commented Oct 2, 2025

This PR creates a pytask pipeline for diurnal aggregation. This is where we are able to aggregate data to night and day values based on the position of the sun. This is separate from the original snakemake + hydra pipeline, but the repo contains both.

Review Instructions

To review this PR, please first clone and install the package in a clean conda environment

conda create -n NAME python=3.12
pip install -e .

Then, symlink the data (in pytask, the bld folder is what they use for data)

ln -s /n/dominici_lab/lab/data_processing/csph-era5_sandbox/bld [YOUR DIRECTORY]/bld

Then, open the docs website to read the notebooks explaining the functionality. You can do this by right clicking _docs/index.html in VSCode and clicking "Show Preview". Alternatively, you can run the notebook code in the notes folder (they are identical).

Notebooks to review:

  1. Pytask demo
  2. Pytask config
  3. Pytask download, pytask aggregate

Next, you can test out pytask in your terminal. Due to the large number of tasks, this can take up to 10 mins to run

# see the current status of all the tasks; may run long
pytask build --dry-run

# to filter tasks by a specific run, eg download 2010 data, use -k and boolean expressions 
pytask build --dry-run -k "download and 2010"

# understand where the tasks come from
pytask collect

# with more detail
pytask collect --nodes

Then, you can delete a file from bld and submit a pipeline job to run it. There is already an sbatch script setup for this to run in parallel:

sbatch pytask.sbatch

Improvements that could be made:

  • We could reduce the number of tasks by looping over the data catalog in groups of rows instead of by individual rows, but there is obviously a tradeoff of speed vs number of tasks

Closes #16 #3 #17 #19 #20

…se resolves Add CC-BY License to the ERA5 dataset #20
…ublish datasets to the harvard dataverse. Also first attempt at nbdev with quarto
…tration of how to use `pytask` to manage data processing tasks in a Pythonic way, leveraging the power of decorators and type hints to define tasks and their dependencies
- Tested out pytask for building pipelines
- Used the pytask data catalog to create sets of tasks as parameters to functions using namedtuples
- Used the pytask data catalog to manage the parallelization of tasks
- Created a pytask logger to log the progress of tasks
- Implemented the download step of querying the ERA5 dataset in pytask
- Began implementation of the aggregation step in pytask:
    - Used the astral library to find the time of sunrise and sunset for each data point in a query
    - Assigned a diurnal class to each data point based on the time of day
    - Aggregation of data points by date and diurnal class in progress
- Adopt Quarto for documentation and notebooks making use of
[this nbdev PR](AnswerDotAI/nbdev#1521) that allows full `.qmd` driven packages
- Convert all `ipynb` files to `.qmd` format
- Use nbdev_docs to generate the documentation website
- Adopt logger that solves #3 (#3)
This commit includes significant updates to the ERA5 data processing pipeline, focusing on using
and demonstrating `pytask` as our workflow management tool. Key changes include:

- Deleted obsolete log files for various datasets from 2015, 2017, 2019, 2021, and 2024.
- Removed unnecessary Hydra configuration files and logs from the 2025-03-17 run.
- Updated SLURM batch script to reduce maximum runtime from 18 hours to 6 hours.
- Add the pytask `config.py` to introduce a demo data catalog and adjust data catalog structure.
- Introduced the query object in `task_download.py` to handle data queries more effectively.
- Add `task_aggregate.py` with a modified function to convert netCDF to GeoTIFF.
- Refactored `task_download.py` to improve query handling and logging.
- Cleaned up imports and improved code organization across multiple modules.
- Updated documentation comments to reflect recent changes and maintain clarity.
- Add nbdev quarto website documentation files.
…each xarray classified and resampled, but we need to convert to raster and then aggregate by polygon... not clear how to do this yet
… use DataFrame for diurnal classification.

WIP: Continue trying to figure out how to rasterize xarray data so that they work with polygon_to_raster_cells function.
First, find the classifications of each point using sun position, then create two copies of the dataset with NaNs in the masked values, then resample by day. Importantly, you must set the time zone to the local time zone for the resampling by day to work correctly.
Parameterization now looks good by using pandas to create a dataframe of all combinations of parameters, filtering the ones that don't apply, and combining the parameters into a single dataframe that can be iterated over in the task function.
…kes a row from the jobs dataframe as input, which makes it easier to manage parameters.

- The algorithm splits the data into day and night based on local time, which is determined from the longitude of the grid cell.

- Remaining steps: change the query to use the new jobs dataframe, and update the notebook to reflect these changes; run and test the entire workflow to ensure everything works as expected; merge the aggregations into a single file per calendar month.
- Separate the qmd and ipynb files for notes and processing to test pipeline integrity
- Refactored `config.py` to enhance the data catalog structure and improve query handling. Data catalog now uses dataframes to manage jobs
- Updated `download.py` to improve the download process and added checks for existing files.
- Improved `pytask_logger.py` for better logging setup.
- Enhanced `task_aggregate.py` to optimize aggregation tasks and ensure proper output handling.
- Updated `task_data_preparation.py` to improve task definitions and exports.
- Refined `task_download.py` to include checks for existing downloads and improve logging.
- Updated Jupyter Notebook metadata to enable execution of all cells.
- Added a new core module for internal functions and testing, including utilities for path expansion, dynamic function importing, and directory structure creation.
- Implemented a Google Drive authentication class for fetching healthshed files.
- Created a ClimateDataFileHandler class to manage different file types from the Climate Data Store (CDS).
- Added a testAPI function to validate API connections and configurations.
- Updated aggregation module to use a specific example file for testing.
- Refactored various notebooks to improve clarity and execution flow.
- Removed unnecessary execution flags from multiple notebooks.
- Enhanced the task_aggregate.py script to include raster calculations and aggregation to healthsheds.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Functionality for Diurnal Readings
2 participants